Skip to main content

All Questions

1vote
1answer
415views

Getting equal distributions of data from different input sets

I am new to ML. I am trying to create a training dataset that is equally distributed between multiple lists, each of which have a different kind of data. How can I do this? I looked into ...
user81371's user avatar
0votes
2answers
717views

Dynamic creation of sklearn pipeline

I am trying to create an automatic pipeline builder functionality that takes into account a large set of conditions such as the existence of missing values, the scale of numerical features, etc., and ...
lazarea's user avatar
1vote
1answer
626views

scikit-learn OneHot returns tuples and not a vectors

First I do a label encoding to all the columns that are strings so they will be numeric. After that, I take just the columns with the labels, convert them to np array, reshape, and convert them to one-...
JamseGoldman's user avatar
1vote
0answers
43views

Should the test dataset be scaled with respect to its distribution or with respect to the distribution of the training dataset? [duplicate]

I have applied data scaling techniques on my training dataset during training. For evaluation, when scaling the test dataset, should it be scaled using the scalers fitted to the training dataset or ...
NIM4's user avatar
0votes
1answer
84views

How should a stateless data transformation be applied in regard to train/test split?

I want to apply spatial sign transformation to my data, but unlike other transformations this one is stateless. I am using sklearn and normallly i would first use ...
Mateusz's user avatar
1vote
1answer
980views

How to impute missing value in Test Set using a custom Imputer created on training dataset

I am working on a toy project to predict claims. One of the input features has null values on which I have applied a custom imputation technique. Under this technique, I replaced missing values with ...
tanmay's user avatar
1vote
1answer
199views

How to control for Co-variate shift in test data set compared to train data for regression task?

I am working on a regression project. But I am facing the problem of covariate shift in features due to time delay.Test data was collected a year later due to which there has been some change in ...
saurabh kumar's user avatar
8votes
1answer
15kviews

Encoding with OrdinalEncoder : how to give levels as user input?

I am trying to do ordinal encoding using: from sklearn.preprocessing import OrdinalEncoder I will try to explain my problem with a simple dataset. ...
Ayush Ranjan's user avatar
2votes
1answer
95views

Is it compulsary to normalize the dataset if doing so can negatively impact a Binary Logistic regression performance?

I am using raw data set with 4 feature variables to do a Binominal Classification using Logistic Regression Algorithm. I made sure that the class counts are balanced. i.e., an equal number of ...
GYSHIDO's user avatar
0votes
0answers
262views

Pre-processing data to make predictions on deployed Sklearn model

I am new to Machine Learning. I have trained a ML model on the Diamond Prices Dataset to predict the price of a diamond given it's features (carat, cut color, clarity, etc...) I have used pickle to ...
Kag Tes's user avatar
3votes
1answer
1kviews

Python - Create many dummy variables from one text variable?

I'm trying to create dummy variables for a variable that has text data in rows. Data in 1st row is: ...
Naveen Reddy Marthala's user avatar
62votes
4answers
58kviews

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about ...
Saurabh Singh's user avatar
51votes
3answers
74kviews

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before ...
tsumaranaina's user avatar
0votes
1answer
403views

What is the best way to normalize histogram vectors to get distribution?

l have the following sample of 4 vectors of dimension 5 . They are sparse vectors, in a way that each value in a vector represent the frequency (number of occurrence of values). For instance v_1=[0,4,...
Joseph's user avatar
1vote
2answers
404views

Pre-process data images before training OneClassSVM and decrease number of features

I want to train a OneClassSVM() using sklearn, and I have a set of around 800 images in my training set. I am using opencv to read the images and resize them to constant dimensions (960x540) and then ...
riadrifai's user avatar

close